Buddy-RAM: Improving the Performance and Efficiency of Bulk Bitwise Operations Using DRAM
نویسندگان
چکیده
Bitwise operations are an important component of modern day programming. Many widely-used data structures (e.g., bitmap indices in databases) rely on fast bitwise operations on large bit vectors to achieve high performance. Unfortunately, in existing systems, regardless of the underlying architecture (e.g., CPU, GPU, FPGA), the throughput of such bulk bitwise operations is limited by the available memory bandwidth. We propose Buddy, a new mechanism that exploits the analog operation of DRAM to perform bulk bitwise operations completely inside the DRAM chip, thereby not wasting any memory bandwidth. Buddy consists of two components. First, simultaneous activation of three DRAM rows that are connected to the same set of sense amplifiers enables us to perform bitwise AND and OR operations. Second, the inverters present in each sense amplifier enables us to perform bitwise NOT operations, with modest changes to the DRAM array. These two components, along with RowClone, a prior proposal for fast row copying inside DRAM, make Buddy functionally complete, thereby allowing it to perform any bitwise operation efficiently inside DRAM. Our implementation of Buddy largely exploits the existing DRAM structure and interface, and incurs low overhead (1% of DRAM chip area). Our evaluations based on SPICE simulations show that, across seven commonly-used bitwise operations, Buddy provides between 10.9X—25.6X improvement in raw throughput and 25.1X—59.5X reduction in energy consumption. We evaluate three real-world data-intensive applications that exploit bitwise operations. First, Buddy improves performance of database queries that use bitmap indices for fast analytics by 6.0X compared to a state-of-the-art baseline using SIMD operations. Second, Buddy accelerates BitWeaving, a recentlyproposed technique for fast database scans, by 7.0X on average across a wide range of scan parameters. Third, for the commonly-used set data structure, Buddy improves performance of set intersection, union, and difference operations by 3.0X compared to conventional implementations. We also describe four other promising applications that can benefit from Buddy, including DNA sequence analysis, encryption, and approximate statistics. We believe and hope that the large performance and energy improvements provided by Buddy can enable many other applications to use bitwise operations.
منابع مشابه
The Processing Using Memory Paradigm: In-DRAM Bulk Copy, Initialization, Bitwise AND and OR
In existing systems, the off-chip memory interface allows the memory controller to perform only read or write operations. Therefore, to perform any operation, the processor must first read the source data and then write the result back to memory after performing the operation. This approach consumes high latency, bandwidth, and energy for operations that work on a large amount of data. Several ...
متن کاملBitwise Operations under RAMBO
In this paper we study the problem of computing w-bit bitwise operations using only O(1) memory probes. We show that under the RAM model there exists a Ω(2) space lower bound while under the RAMBO model this space bound goes down to O(w) bits. We present algorithms that use four different RAMBO memory topologies to perform bitwise boolean operations and shift operations.
متن کاملSTT-RAM Aware Last-Level-Cache Policies for Simultaneous Energy and Performance Improvement
High capacity Last Level Cache (LLC) architectures have been proposed to mitigate the widening processor-memory speed gap. These LLC architectures have been realized using DRAM or SpinTransfer-Torque Random Access Memory (STT-RAM) memory technologies. It has been shown that STT-RAM LLC provides improved energy efficiency compared to DRAM LLC. However, existing STT-RAM LLC suffers from increased...
متن کاملUnderstanding and Improving the Latency of DRAM-Based Memory Systems
Over the past two decades, the storage capacity and access bandwidth of main memory have improved tremendously, by 128x and 20x, respectively. These improvements are mainly due to the continuous technology scaling of DRAM (dynamic random-access memory), which has been used as the physical substrate for main memory. In stark contrast with capacity and bandwidth, DRAM latency has remained almost ...
متن کاملCouture: Tailoring STT-MRAM for Persistent Main Memory
Modern computer systems rely extensively on dynamic random-access memory (DRAM) to bridge the performance gap between on-chip cache and secondary storage. However, continuous process scaling has exposed DRAM to high off-state leakage and excessive power consumption from frequent refresh operations. Spintransfer torque magnetoresistive RAM (STT-MRAM) is a plausible replacement for DRAM, given it...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- CoRR
دوره abs/1611.09988 شماره
صفحات -
تاریخ انتشار 2016